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Abstract- Data Mining is one of the interdisciplinary fields 
on the research area. Association rule mining plays a vital 
role in the data mining for finding significant relations in 
biological data. Microarray technology is mainly used by 
the researchers to find the meaningful relations among 
gene expression data. In this research paper, the 
statistical t-test has been applied to select the significant 
genes, equal frequency binning method has been 
implemented for discretizing the gene expression data, 
Boolean Association Rules (BAR) generate the frequent 
gene expression intervals and finally, the association 
rules has been discovered. Association rules discover the 
significant relations among microarray gene expression 
data. It exposes the correlation among the gene 
expression and used to provide the significant decision 
for cancer diagnosis. 

Keywords- Microarray, Gene Filtering, Equal Frequency, 
Frequent Pattern Mining and Association Rule Mining. 


1. INTRODUCTION 

Now-a-days, huge amount of data are being coliected 
from Biologicai data. Analyzing and extracting information 
from huge amount of data is difficult. Data mining techniques 
have been used to get the effective knowledge from the huge 
amount of data. In this paper, the proposed methodology 
focuses on association rule mining technique to extract 
interesting relationships among set of genes in the field of 
bioinformatics. 

Microarray technologies provide the opportunity to 
compute the expression level of tens of thousands of genes in 
cells simultaneously. One interesting fact about microarray 
data is that the behaviors of thousand of genes can be 
examined at different times. Gene expression is the process of 
transcribing DNA sequence to MRNA sequence which is later 
referred to as the amino acid sequence known as protein. The 
number of produced versions from RNA is called gene 
expression level. Microarray experiments contains huge 
amount of data. Main challenge on microarray data is high 
density of data. Data collected from microarray experiments is 
in the form of R x C matrix of expression level, where R 
represents Rows (experiments) and C represents Columns 


(genes). Microarray contains an order of magnitude more 
genes than experiments. In this paper, it has been focused on 
microarray gene expression interval association analysis from 
the frequent pattern mining. Frequent pattern mining is the 
most important task of association rule mining. Microarray 
gene expression interval association analysis is exploring the 
biological relevant association between different genes under 
different experimental samples. The rest of the paper is 
organized as given below. The related papers are reviewed in 
Section 2. 

The proposed methodology of Boolean association 
rules (BAR) are illustrated in Section 3. The experimental 
results are shown in Section 4. Conclusion of the work is 
discussed in Section 5. 

2. RELATED WORKS 

In order to do the survey various algorithms have 
been studied. Extracting the interesting relationships among 
set of genes using gene intervals and association rules, the 
researcher must know the basic knowledge of gene filters, 
discretization techniques and association algorithms. 

Jeanmougin, M, et al. have discussed the statistical 
approaches to select genes differentially expressed between 
two groups is to apply a t-test and compared with various 
statistical methods to find the significant genes [1]. 

Garcia, S. et al. have made a survey on discretization 
techniques. Discretization is an essential preprocessing 
technique to transform a set of continuous attributes into 
discrete attributes, by associating categorical values to 
intervals [2], 

Alves, R., et al. have discussed frequent pattern 
methods for gene association analysis. The dense datasets 
such as telecommunications, microarrays, etc., where there are 
many long frequent patterns. Hence, these methods scale very 
poorly and sometimes are impractical. This drawback is due to 
the high computational cost used by apriori algorithm. Then 
they pointed out that the tree based methods such as Frequent 
Pattern (FP-growth) may find difficulties when dealing with 
high dimensional datasets [3]. 
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Zakaria, W., et al. have proposed a column 
enumeration based algorithm using high confidence 
association rules for up and down expressed genes. Then they 
explained that the generating all frequent itemsets in dense 
datasets requires large memory [4]. 

Alagukumar, S., et al. have discussed the microarray 
data analysis using association rule mining. They compared 
the frequent pattern mining methods using Apriori and FP- 
Growth on microarray gene expression data [5]. 

Wur, S.Y., et.al. have proposed effective boolean 
algorithm for mining association rules in large databases. The 
sparse matrix approach has been given better performance 
over the Apriori algorithm [6] . 

From the literature study, it has been concluded that 
microarray dataset typically contain high density of data. 
Association rules have been proved to be useful in analyzing 
such datasets. 

However, the most existing association rule mining 
algorithms are unable to efficiently handle normalized 
microarray datasets with continuous values. 

The existing association rule mining algorithms 
requires large memory and takes exponential time for 
generating frequent gene expression pattern and discovering 
association rules. In this paper, a new algorithm called 
Boolean Association Rule (BAR) is described that is specially 
designed to select the significant genes, generate frequent 
gene expressions intervals and discover association rules from 
microarray gene expression data using gene intervals with less 
memory and low computational time. 

3. METHODOLOGY 

Association rule mining finds frequent item-sets 
whose occurrences exceed a predefined threshold in the 
dataset. Then it generates association rules from frequent item 
sets with the support and confidence. Association rule mining 
is applied on microarray data set to extract interesting 
associations among set of genes. 



Read the Gene expression data 


$ 


Preprocessing 


Gene Filtering using t-test 


Discretization using Equal 
Frequency Binning Method 

Convert the discretized data to 
transaction data 


Gene Association Analysis 


Generate frequent items using 
Boolean method 


Extract the Association Rules 


End 

Fig.1 : Block diagram of BAR 


In Boolean Association Rule (BAR), item-sets are 
gene expression intervals. The aim of Boolean Association Rule 
(BAR) is to extract the frequent gene expression intervals and 
then use them to generate association rules. Before mining. 
Boolean Association Rule (BAR) selects the significant genes 
from microarray gene expression data, and transforms the data 
by converting continuous gene expression data into discretized 
gene expression data. Finally, the discretized data are used as 
transaction data for mining. 

In this research paper, it has been proposed a Boolean 
Association Rule (BAR) for microarray gene association 
analysis using frequent gene expression intervals and 
association rules shown in figure 1. The Boolean Association 
Rule (BAR) comprises of two phases, namely preprocessing 
and gene association analysis. The pseudo code for overall 
algorithm is illustrated in the figure 2. 

Input : Microarray gene expression data 

Output : Micro array gene association rules 

Begin 

Step 1 : Read the gene expression data 

Step 2 : Filter Significant genes using t-test 

Step 3 : Discretize the gene expression data using equal 
frequency binning method 

Step 4 : Convert the discretized gene intervals data to 
transaction data 

Step 5 : Generate the frequent gene intervals using 
Boolean method 

Step 6 : Extract the Microarray gene association Rules 
using Boolean method 

End 


Fig. 2: BAR Algorithm 

At the end of two phases the frequent patterns and 
significant relations among microarray gene expression 
intervals are extracted. 

A. Preprocessing 

Data preprocessing is a one of the data mining 
technique which involves transforming unprocessed data into 
an understandable format. Real world data is often deficient, 
inconsistent. Data preprocessing is a proven method of 
resolving such issues. In this paper, the informative genes are 
selected using gene filtering and the continuous gene 
expression data are transformed into discrete data using 
discretization technique. 

1. Gene Filtering using t-test 

The gene filtering is the process of selecting the 
differentially expressed genes and statistically significant in 
the gene expression data using t-test method. The t-test is the 
most often used to analyze microarray data. The t-statistic 
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provides a standardized estimate of differential expression 
based on the following formula 


T = 


(A-B)/c p 



W 


n 0 m 0 

A 2 - nA + ^ 

d = -* i 

P n + 777-2 

Where A — B represents is sample means and& p is an 
unbiased estimator for standard deviation. 

First calculate the sums of squares and Correlation 
factor, nA : to subtract to give n times normal sample 
variance also called the sum of squared residuals, the 
associated probability under the null hypothesis is calculated 
by reference to the t- distribution with n + m — 2 degree of 
freedom. The p-value is used to determine if a number is 
significantly different from normal. A p-value of 0.05 or less 
is commonly measured statically significant. The t-statistics 
value will be calculated and the p-value calculated from t- 
distribution with n-2 degrees of freedom. Finally, the 
differentially expressed genes and statistically significant 
genes are selected or biological significance based on 
probability with degrees of freedom N-2 and p < 0.05. The 
uninformative genes are removed from gene expression. 

2. Discretization using Equal Frequency Binning 

Data discretization is a commonly used as pre- 
processing method that reduces the number of distinct values 
for a given continuous variable by dividing its range into a 
finite set of disjoint intervals, and then relates these intervals 
with meaningful labels [7]. Subsequently, data are analyzed 
or reported at this higher level of knowledge representation 
rather than the individual values, and thus leads to the 
simplified data representation in data exploration and data 
mining process. Discretization methods can be supervised or 
unsupervised. In this paper, the Equal Frequency Binning 
method has been implemented to discrete the gene 
expression values. 

The equal-frequency algorithm determines the 
minimum and maximum values of the discretized 
attribute, sorts all values in ascending order, and divides 
the sorted continuous values into k intervals such that 
each interval contains approximately n/k data instances 
with adjacent values [7]. 

n 

Intervals = - . . . (3 I 

k 

B. Gene Association Analysis 

The Boolean Association Rule (BAR) algorithm finds 
useful patterns and rules from transaction data using boolean 
method. These patterns and rules are very useful for decision 
making. The boolean method generates the frequent 
microarray gene intervals without generating the candidate 
item sets and extracting the association rules in two steps. 


Step-1: The frequent gene interval sets are identified using 
bitwise OR and bitwise AND operations. 

Step-2: Microarray gene association rules are generated using 
bitwise AND and bitwise XOR operations from the frequent 
gene interval sets. 

1. Frequent gene interval sets 

Given a set of genes G= {gl, g2, g3 ... gn) and a set 
of samples tID = {si, s2, s3... sm), a subset of G, S G is 
called a frequent, if support(S) > minimum support, where 
minimum support is a user defined threshold [8], 

2. Microarray Gene Association Rules 

Association Rule: Let G = (g Iy g 2 , gs ... g„j be a set of n 
elements called genes. A rule is defined as an implication of 
the form X T, where X,Y 5= Gand X fl Y = C[8]. The left- 
hand side of the rule is named as antecedent and right-side of 
the rule is named as consequent. 

Support: The Rule A i Y holds in the transaction set 7 with 
Support S, Where S is the percentage of samples in T that 
contain X U F[8] 

„ Supports \JY) 

Sbpport U -+ Y) = pj 

Confident: The Rule X -* Y has confidence C in the 
transaction set T, where C is the percentage of samples in T 
containing X that also contain Y [8]. 

4. EXPERIMENTAL RESULTS 

The sample gene expression data related to breast 
cancer2 dataset consists of 30 samples and 16 genes are 
shown in table 1. The sample gene expression data are filtered 
using statistical t-test gene filtering method. The filtered gene 
expression data are shown in table 2. After gene filtering, the 
gene expression data are transformed into gene intervals using 
equal frequency binning discretization method, where the data 
clustered into 2 distinct clusters are shown in table 3. 

Finally, gene intervals are converted into 
transactional data where samples are represented by 
transactions and gene intervals are represented by item sets as 
shown in table 4. In microarray gene association, the frequent 
gene intervals sets are generated with minimum support count 
50% as shown in table 5. 

From the frequent gene intervals sets, the association 
rules are extracted with support 50% and confidence 100% as 
shown in table 6. Finally the biological knowledge is extracted 
from the association rules. It provides gene targeting treatment 
decisions for cancer patients. 


Table 1 : Sample Microarray Gene Expression Data 


Sample 

SI 

S2 

S3 

S4 

S5 

Sn 

LYPD6 

0.45 

- 1.49 

- 0.46 

0.65 

0.08 


PTGER3 

1.66 

2.26 

1.22 

3.98 

3.59 


EST_1 

- 0.62 

- 0.68 

- 0.8 

- 0.68 

- 0.83 


EST_2 

- 0.49 

- 0.38 

- 0.56 

- 0.48 

- 0.41 


CHDH 

0.86 

1.17 

0.34 

1 

- 1.82 


EST_3 

0.64 

0.48 

- 0.34 

- 0.04 

- 1.24 
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IL17BR 

-0.7 

-1.1 

-2.16 

-0.59 

-3.56 


SCYA4 

7.13 

7.11 

7.05 

6.51 

8.58 


IL1R2 

1.37 

1.65 

0.53 

1.44 

1.1 


ABCC11 

5.96 

6.55 

4.35 

6.82 

6.02 


HOXB13 

-3.1 

3.2 

2.05 

-3.33 

1.12 


APS 

-1.45 

0.68 

-0.17 

0.22 

-0.45 


ESTs_4 

-2.57 

2.3 

1.11 

-2.54 

0.59 


DOK2 

2.04 

1.69 

1.77 

3.09 

2.24 


EST_5 

1.83 

1.55 

1.94 

1.25 

2.1 


GUCY2D 

1.99 

4.25 

5.49 

1.56 

2.59 



Table-2: Filtered Gene Expression Data 


Sample 

LYPD6 

EST_2 

EST_3 

IL17BR 

IL1R2 

ABCC11 

si 

0.45 

-0.49 

0.64 

-0.7 

1.37 

5.96 

S2 

-1.49 

-0.38 

0.48 

-1.1 

1.65 

6.55 

S3 

-0.46 

-0.56 

-0.34 

-2.16 

0.53 

4.35 

S4 

0.65 

-0.48 

-0.04 

-0.59 

1.44 

6.82 

S5 

0.08 

-0.41 

-1.24 

-3.56 

1.1 

6.02 

Sn 








Table-3: Microarray Gene Expression Intervals 


Sam 

LYP 





ABCC 

pie 

D6 

EST_2 

EST_3 

IL17BR 

IL1R2 

11 



[-0.56, 

[ 0.48, 


[0.53, 

[4.35, 


[ 0.45, 



[-0.70,- 




0.65] 

-0.41] 

0.64] 

0.59] 

1.44] 

6.55] 

SI 








[- 

1.49, 

[-0.41, 

[ 0.48, 

[-3.56,- 

[1.44, 

[6.55, 

S2 

0.45] 

-0.38] 

0.64] 

0.70] 

1.65] 

6.82] 


[- 

1.49, 

[-0.56, 

[-1.24, 

[-3.56,- 

[0.53, 

[4.35, 

S3 

0.45] 

-0.41] 

0.48] 

0.70] 

1.44] 

6.55] 


[ 0.45, 

[-0.56, 

[-1.24, 

[-0.70,- 

[1.44, 

[6.55, 

S4 

0.65] 

-0.41] 

0.48] 

0.59] 

1.65] 

6.82] 


[- 

1.49, 

[-0.41, 

[-1.24, 

[-3.56,- 

[0.53, 

[4.35, 

S5 

0.45] 

-0.38] 

0.48] 

0.70] 

1.44] 

6.55] 

Sn 








Table-4: Transaction dataset 


tID 

Itemset 

SI 

(LYPD6[ 0.45, 0.65], EST_2[-0.56,-0.41], 

EST_3[ 0. 48,0.64], IL17BR=[-0. 70, -0.59], 
IL1R2=[0.53,1.44],ABCC11=[4.35,6.55]} 

S2 

{ LYPD6=[- 1. 49,0.45], EST_2=[-0. 4 1,-0. 38], 

EST_3=[ 0. 48, 0.64], IL17BR=[-3. 56,-0. 70], 

IL1R2=[ 1 .44, 1 .65JABCC1 1=[6.55,6.82] } 

S3 

]LYPD6=[-1.49,0.45],EST_2=[-0.56,-0.41], 

EST_3=[-1.24,0.48],IL17BR=[-3.56,-0.70], 

IL1R2=[0.53,1.44),ABCC11=[4.35,6.55]} 

S4 

{LYPD6=[ 0.45,0.65],EST_2=[-0.56,-0.4 1 ), 
EST_3=[-1.24,0.48],IL17BR=[-0.70,-0.59], 

IL1R2=[ 1 .44, 1 .65JABCC1 1=[6.55,6.82] } 

S5 

{ LYPD6=[- 1. 49,0.45], EST_2=[-0. 4 1,-0. 38], 
EST_3=[-1.24,0.48],IL17BR=[-3.56,-0.70], 
IL1R2=[0.53,1.44],ABCC11=[4.35,6.55]} 


Table-5: Frequent Gene Expression Intervals set 


Frequent Gene Expression Intervals 
set 

Support Count 

LYPD6=[- 1.49,0.45] 

3 

EST_2[-0.56,-0.4 1 ] 

3 

EST_3=[-1. 24,0.48] 

3 

IL17BR=[-3.56,-0.70] 

3 

IL1R2=[0.53,1.44] 

3 

ABCC11=[4.35,6.55] 

3 

LYPD6=[- 1.49, 0.45], 
IL17BR=[-3.56,-0.70] 

3 

IL1R2=[0.53,1.44] 

ABCC11=[4.35,6.55] 

3 




Table-6: Association Rules 


Antecedent 

Consequent 

Sup. 

Conf. 

EST_3-1.24,0.48] 
ABCC11 [4.35,6.55] 

IL17BR[-1.83 : -0.59] 
IL1R2[0.53,1.44] 

50% 

100% 

IL17BR[-1.83 : -0.59] 
IL1R2[0.53,1.44] 

EST_3[- 1.24,0. 48] 

50% 

100% 

IL17BR[-1.83 : -0.59] 
IL1R2[0.53,1.44] 

ABCC11 [4.35,6.55] 

50% 

100% 

IL17BR[-1.83 : -0.59] 
IL1R2[0.53,1.44] 

EST_3[-1. 24,0.48] 
ABCC11 [4.35,6.55] 

50% 

100% 

IL17BR[-1.83 : -0.59] 
ABCC11 [4.35,6.55] 

EST_3[- 1.24,0. 48] 

50% 

100% 

IL1R2=[0.53:1.44] 

ABCC11=[4.35:6.55] 

60% 

100% 

ABCC11=[4.35:6.55] 

IL1R2=[0.53:1.44] 

60% 

100% 


The experiments are performed on a computer with 
Intel Core 2 Duo CPU and 2GB of main memory. The 
proposed algorithm implemented in Java language with 
JDK1.4 version. The microarray breast cancer2 gene 
expression data were taken from National centre for 
Biotechnology Information (NCBI) [9]. 
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A. Comparative analysis 

HOO 



Fig. 3: Comparative Analysis of Rule Generation 

The proposed BAR algorithm compared with 
classical association algorithms such as Apriori and FP- 
Growth algorithms. The proposed association algorithms 
discover the less number of rules and reduce the time 
complexity as well as memory to compare with bench mark 
algorithms. The figure 3 depicts the comparative analysis of 
rule generation of proposed algorithm with Apiori and FP- 
Growth algorithms. 


Procedia Computer Science, no. 47, pp.3-12, 2015. 
http://dx.doi.Org/10.1016/j.procs.2015.03.177 

[6] S.Y. Wur, and Y. Leu, “An Effective Boolean Algorithm 
for Mining Association Rules in Large Databases” 
Database Systems for Advanced Applications, IEEE 
Transactions, pp. 179-186, 1999. 

[7] R. Dash, and R.L. Paramguru, “Comparative analysis of 
Supervised and Unsupervised Discretization Techniques”, 
International Journal of Advances in Science and 
Technology, vol. 2, no. 3,pp. 29-7,201 1. 

[8] J. Han, and M. Kamber, “Data Mining: Concepts and 
Techniques”, Morgan Kaufmann Publishers, Elsevier, 
2002 . 

[9] www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE1379. 


6. CONCLUSION 

The proposed Boolean Association Rule (BAR) algorithm 
obtains frequent item set without candidate generation and 
scans the database only once. It reduces the time complexity 
and memory usage. The proposed Boolean Association Rule 
(BAR) algorithm extracts significant relations among 
microarray genes. The experiments were carried out by using 
the microarray breast cancer 2 dataset. Additionally, in this 
paper, the algorithm has been compared with other traditional 
bench mark algorithms such as Apriori, FP-growth. Apriori 
algorithm requires large memory and takes exponential time 
for candidate generation. FP-growth generates frequent item 
set without candidate item set generation. Hence it requires 
less memory and scans the database only two times. The result 
of the comparative analysis revealed that the Boolean 
Association Rule (BAR) performed better than other methods. 
The result of this work can be used to reveal crucial resource 
for diseases and provide gene targeting treatments. 
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